Text Mining Methods for Mapping Opinions from Georeferenced Documents
نویسنده
چکیده
With the growing availability of large volumes of textual information on the Web, text mining techniques have been gaining a growing interest. One specific text mining problem that is increasingly relevant relates to the detection of textual expressions that refer to opinions on certain topics and services. A second text mining problem, which has also been gaining a growing interest, is the identification of the geographic location that best relates to the contents of particular documents. In my MSc thesis, I empirically compared automated techniques, based on language models, for assigning documents to opinion classes and to the geospatial coordinates of latitude and longitude that best summarize their contents. Using this information, I then analyzed the possibility of building thematic maps portraying the incidence of particular classes of opinions, as extracted from documents, in different geographic areas. An extensive experimental validation has been carried out over the different components, using documents from Wikipedia and reviews from Yelp. The best performing method for geocoding textual documents combines character-based language models with a post-processing technique that uses the coordinates from the 5 most similar training documents, obtaining an average prediction error of 265 Kilometers, and a median prediction error of just 22 Kilometers. In what concerns opinion mining, analysis of opinion was done in a two-point scale schema (i.e., polarity of opinion) and a five-point scale schema (i.e., considering five degrees of opinions). The best performing methods used character-based language models, and for which the two-point scale case achieved an accuracy of 0.80. The best performing method for the five-point scale, based on a hierarchical classifier, achieved an accuracy of 0.50. A technique known as Kernel Density Estimation was used in the development of the thematic maps, and an empirical analysis has shown that the maps obtained through automatic extraction indeed correspond to an accurate representation for the geographic distribution of opinions.
منابع مشابه
ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متنکاوی در حوزه یادگیری الکترونیکی
As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...
متن کاملA review of text mining approaches and their function in discovering and extracting a topic
Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling. Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...
متن کاملحسنگار : شبکه واژگان حسی فارسی
Awareness of others' opinions plays a crucial role in the decision making process performed by simple customers to top-level executives of manufacturing companies and various organizations. Today, with the advent of Web 2.0 and the expansion of social networks, a vast number of texts related to people's opinions have been created. However, exploring the enormous amount of documents, various opi...
متن کاملRule Based System for Enhancing Recall for Feature Mining from Short Sentences in Customer Review Documents
This paper discovers rules for enhancing the recall values of sentences containing opinions from customer review documents. It does so by mining the features and opinion from different blogs, news site, and review sites. With the advent of numerous web sites which are posting online reviews and opinion there has been exponential growth of user generated contents. Since almost all the contents a...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کامل